Evaluation of genetic demultiplexing of single-cell sequencing data from model species

The authors demonstrate that scRNA-seq sample pooling followed by genetics-based separation of individuals is an effective means to identify individual samples in a variety of commonly studied species.


Introduction
Over the last decade, single-cell RNA sequencing (scRNA-seq) has exploded in popularity as a species agnostic tool for studying gene expression at the level of individual cells (Klein et al, 2015;Macosko et al, 2015;Villani et al, 2017;Han et al, 2018). The biggest impact has been on species in which study at the cellular level was long difficult or impossible because of lack of species-specific antibodies or other reagents (i.e., all species other than rodents and primates). Although extremely powerful, the scRNA-seq approaches that have risen to prominence are expensive and low throughput for biological replicates. This is a serious drawback as the lack of biological replicates has also been shown to be a major cause of false discoveries (Hicks et al, 2018;Squair et al, 2021). To limit artifacts in scRNA-seq, there is a critical need for approaches that allow for adequately powered experiments (Squair et al, 2021;Zimmerman et al, 2021). Sample pooling is an effective means to increase biological replicate throughput while simultaneously decreasing batch effects and costs. Sample pooling can also enable scRNA-seq experiments with limited cell quantities and can identify doublets which, depending on cell loading, can make up a substantial portion of sequenced droplets (DePasquale et al, 2019).
In fields working with low cell numbers, like developmental biology, pooling of samples from multiple animals with no sample labeling method, or intention of demultiplexing has become a standard practice. This approach lacks advantages of true replicates because there is no way to assess the data for representation of all samples or variation between samples. The inability to demux pooled samples thus lacks the ability to account for replicate variation and perform replicate strengthened differential expression analysis (Squair et al, 2021;Zimmerman et al, 2021). Because of these benefits of biological replicates, pooled samples in which biological replicates were collected would ideally be able to be demultiplexed, providing information on the origin of each cell in the experiment.
Methods for analyzing pooled data and for enabling the demultiplexing (also known as demuxing) of pooled scRNA-seq samples are varied in concept and accuracy and have been recently reviewed . To preserve the benefits of true biological replicates within pooled samples, experimental protocols for demultiplexing of pooled scRNA-seq samples have been developed. These methods include pooling cells from transgenic animals expressing a distinct transgene (Lin et al, 2021) or oligonucleotide (Shin et al, 2019). The most common method for scRNA-seq pooling is cell hashing, in which cells from each sample are labeled with antibodies (Stoeckius et al, 2018), lipids (McGinnis et al, 2019), or chemicals (Gehring et al, 2020) tethered to an oligonucleotide label that links gene expression data from each cell to the cellular origin (i.e., cell multiplexing oligonucleotide (CMO) label). A downside to these label-based approaches is that they each have varying degrees of efficiency, require sufficient cell numbers, have costs associated with their application, are not compatible with all species, and have a chance of failing .
In contrast to these experimental demultiplexing approaches, computational methods have been developed to demultiplex pooled human samples without any labeling regimen using the natural genetic differences between individuals. These approaches detect genetic differences between samples at sites of singlenucleotide polymorphisms (SNPs) and implement demultiplexing based on differential distributions of these SNPs between samples. SNP-based approaches have been benchmarked and shown to be highly effective at separating human samples (Kang et al, 2018;Huang et al, 2019;Xu et al, 2019;Heaton et al, 2020;Weber et al, 2021). Relative to laboratory species, SNP-based computational approaches are expected to perform well in human samples because of their high genetic diversity and wealth of available genomic resources. Outside of human samples, SNP-based demultiplexing has been applied to demultiplex other species: plasmodium samples, and across mouse strains (Heaton et al, 2020;Mylka et al, 2022). Although these results are promising, thorough evaluation of the strengths, weaknesses, and limitations of SNP demuxers in model organisms-through comparison of results to ground truths derived from orthogonal wet laboratory-based methods-is required to demonstrate whether this is an approach that warrants widespread use.
In this project, we set out to learn whether SNP-based demultiplexers work in an array of non-human species. We benchmarked available SNP-based demultiplexing programs and found that most are highly accurate on model organism datasets. We then selected the demultiplexing tool with the broadest potential usability, souporcell (Heaton et al, 2020), to more thoroughly test against a diverse range of species and wet laboratory-based multiplexing methods. We found that SNP-based demuxers can successfully demultiplex single-nuclei RNA-seq and scRNA-seq datasets without any prior information about the samples and with only a de novo transcriptome as a reference. This suggests that SNP-based demuxers can facilitate effective experimental design via the demultiplexing of pooled scRNA-seq data from a vast range of species. This lowers burdens and substantially broadens the use of scRNA-seq to study cellular processes in most organisms.

Results
Highly accurate SNP-based demultiplexing of in silico pooled sc-seq from zebrafish, monkey, and axolotl We first explored the performance of SNP-based demultiplexing methods when applied to a published zebrafish dataset. Two useful resources for applying SNP-based demultiplexing are available for zebrafish, a high-quality genome (Howe et al, 2013), and a common SNP variant, variant call format (VCF) file for the species (LaFave et al, 2014). A recent study collected single animal scRNA-seq datasets from the thymus of zebrafish (Rubin et al, 2022). These data present an opportunity to synthetically pool samples to test SNP-based demultiplexing on data from a non-human species. Although synthetically pooled data are less challenging for an SNPbased demultiplexing algorithm than experimentally pooled samples, the individually sequenced libraries provide a ground truth for definitive benchmarking of SNP-based assignments.
After processing each zebrafish scRNA-seq sample individually, we performed in silico pooling of three samples (Fig 1A). This synthetic pooling creates a pooled sample that has cell origin information and synthetic doublets. The production of synthetic doublets bolsters the ability to test and validate the accuracy of cell and doublet assignments. The synthetically pooled data were then demultiplexed with the SNP-based demuxers souporcell (Heaton et al, 2020), Vireo (Huang et al, 2019), Freemuxlet (https:// github.com/statgen/popscle; Kang et al, 2018), and scSplit (Xu et al, 2019)-all of which do not require prior genotyping information. We found that souporcell, Vireo, and Freemuxlet gave highly concordant results, with scSplit giving divergent assignments for one individual (Fig S1A and B). The computational requirements for this dataset were similar between these tools, with Freemuxlet finishing the quickest and scSplit requiring the least memory (Table S1). We also ran souporcell on this dataset with and without providing the common SNPs VCF file. We found that for this zebrafish dataset, the use of the common VCF file had minimal overall effect on the souporcell demultiplexing results, with less than 1% of cell assignments differing (11/1,312 cells). Given the highly similar results and compute requirements from souporcell, Vireo, and Freemuxlet, we chose to move forward with more thorough analysis of souporcell because this tool has the broadest universal applicability for samples from traditional and non-traditional model species due to its lesser input requirements (Fig S2).
To further investigate the demultiplexing accuracy of souporcell, SNP-based cell assignments were assessed for correlations with ground truth animal origin. We found a strong agreement between souporcell assignments and ground truth animal identities (Fig 1B and C). We then analyzed how many cells of each known animal origin were assigned to each animal by souporcell (Fig 1C, left as total cell quantities, right as percentages). SNP-based demultiplexing of this synthetically pooled zebrafish scRNA-seq data was highly accurate, resulting in correct cell assignments of 99-100% of cells based on their known animal origin ( Fig 1C). This suggests that genetic demultiplexing is a viable means to enable sample pooling and subsequent demultiplexing in zebrafish.
The presence of doublets in single-cell RNA sequencing is a major confounder. A doublet is a droplet represented by a singlecell barcode that contains more than one cell. In these in silico pools, true doublets can be identified with absolute certainty because we have origin information. When comparing SNP-based demultiplexing results to ground truth, "confirmed doublets" are cells that were assigned doublets by both the ground truth and demuxer. Furthermore, "contested doublet" refers to cells in which the experimentally derived ground truth and SNP-based demuxer result disagree about a potential doublet. We thus investigated the doublet detection capacity of souporcell for heterotypic true doublets (i.e., doublets from two genetically distinct individuals in a defined in silico pool). Homotypic true doublets created during synthetic pooling were removed, as souporcell relies on intergenotypic doublet detection. We found that souporcell missed almost half of the synthetic heterotypic true doublets in the pooled dataset ( Fig 1C). The relatively poor performance (55% heterotypic true doublets identified) of souporcell at identifying synthetic heterotypic doublets from this zebrafish dataset made doublet detection the largest discrepancy in this analysis. Souporcell was not alone in its poor doublet detection performance, with Freemuxlet and Vireo misassigning many of the same doublets as souporcell ( Fig S1B). This suggests that SNP-demultiplexing tools struggle with heterotypic doublet detection when only three animals are pooled. Figure 1. SNP-based demultiplexing enables demultiplexing of synthetically pooled zebrafish and African green monkey scRNA-seq data. (A) Conceptual diagram of benchmarking analysis for zebrafish data (n = 3). (B) Upset plot comparing cell assignments by souporcell to known animal identities for each cell. Souporcell assignments were matched with known identities by correlation analysis. (C) Bar plots quantifying the distribution of souporcell assignments for cells from each animal. Left: of cells known to originate from each animal, the number of those cells assigned by souporcell to each animal is plotted. Right: of cells known to originate from each animal, souporcell assignments are shown as a percentage of total cells assigned to that animal. (D, E, F) Same as (A, B, C) but for African green monkey (n = 5) scRNA-seq data.
We next decided to in silico benchmark these SNP demultiplexers on another potentially inbred population, the African green monkey. The African green monkey is a pre-clinically relevant species, with a published genome (Warren et al, 2015) and SNP variant VCF file available for use (Huang et al, 2015). For this experiment, five individually sequenced green monkey scRNA-seq datasets were filtered for high-quality cells, pooled in silico, and subsequently demultiplexed (Fig 1D) (Speranza et al, 2021). Our results from this demultiplexing closely mirrored that of the zebrafish dataset in terms of accuracy and the similarity of results from independent SNP-based demultiplexing programs (Fig S3). Compared with demultiplexing of three zebrafish samples, the doublet detection observed with this pool of five monkey samples was improved (82% versus 55%) (Fig 1E and F). We thus tested whether the number of samples pooled contributes to doublet detection efficiency. We pooled three of the same monkey samples and SNP demultiplexed them using souporcell. We found that souporcell only called 66% of true heterotypic doublets from this pool of three monkey samples ( Fig S4). This suggests that the number of samples in a pool contributes to efficiency of heterotypic doublet detection. Overall, these results indicate that SNP-based demultiplexing can be a highly accurate and efficient method for demultiplexing pooled single-cell data from non-human primates.
Finally, we also assessed results of SNP-based demultiplexing of synthetically pooled single-nuclei data from axolotl. The axolotl is an example of the type of organism for which scRNA-seq has enabled cell level study of regeneration and immunology for the first time (Gerber et al, 2018;Leigh et al, 2018;Rodgers et al, 2020;Lin et al, 2021;Lust et al, 2022;Ye et al, 2022). A genome Smith et al, 2019) and SNP variant VCF file (Timoshevskaya et al, 2021) are available for the axolotl, but its large genome provides a distinct challenge when using computational tools. We used in silico pooling for three individually sequenced axolotl single-nuclei RNA-seq datasets ( Fig S5A) (Lust et al, 2022). 95-99% of the cells from each individual axolotl dataset were correctly demultiplexed from the pool (Fig S5B and C). Similar to the three sample zebrafish datasets, we found that souporcell failed to identify a large proportion of the synthetic heterotypic doublets ( Fig  S5C). These results suggest that SNP-based demultiplexing may accurately demultiplex pooled samples in any single-cell compatible organism with genetic heterogeneity.

Identifying the limits of SNP-based demultiplexing: inbreeding and SNPs density
With the promising results observed in zebrafish, African green monkey, and axolotl, we next investigated the minimum level of animal genetic variation required to enable SNP-based demultiplexing. To do this, we used previously published data from highly inbred isogenic mice (Crowl et al, 2022). Our efforts to apply SNPbased demultiplexing to an in silico-pooled mixture of three C57BL/ 6 mice samples (Crowl et al, 2022) were unsuccessful, with 99.7% of cells being unassigned by souporcell or Vireo (data not shown). C57BL/6 mice are nearly genetically identical with only~15,000 SNPs and insertion-deletion mutations (indels) (Doran et al, 2016), providing evidence that SNP-based demuxers are unable to separate biological replicates from highly inbred mice. We also attempted SNP-based demultiplexing of in silico-pooled data from DBA/1J mice (Shinozaki et al, 2022) which have~5 × 10 6 total SNPs and indels (Doran et al, 2016). Again, 99.7% of cells were unable to be assigned via souporcell (data not shown), suggesting that SNPbased demultiplexing is incompatible with highly inbred animals. These findings are in line with other claims that SNP-based demultiplexers are unable to separate samples within the same mouse strain (Mylka et al, 2022). Although the vast majority of murine work is performed on a single strain, pooling different strains reportedly does allow for SNP-based demultiplexing (Mylka et al, 2022).
To try to determine SNP frequency thresholds required for successful SNP-based demultiplexing, we investigated the density of SNPs in all the datasets used thus far. We found that both inbred mouse strains had less than 0.2 SNPs/kilobase (kb) (0.013 and 0.19 for C57BL/6 and DBA1/J, respectively). In comparison to the inbred mouse strains, the SNP density of the successfully demultiplexed datasets was higher at 0.34/kb (axolotl), 0.86/kb (African green monkey), and 3.57/kb (zebrafish). Although an exact quantitative analysis of this possible genetic cutoff would be useful, these results imply that a range of 0.2-0.34 SNPs/kb may be the minimum required within a sc-seq dataset for SNP-based demultiplexing.

SNP-based demultiplexing of high sample number pools
Our previous results with in silico-pooled scRNA-seq data indicated high accuracy of SNP-based demultiplexers, but this setup lacks challenges of physically pooled scRNA-seq data, like ambient RNA and real heterotypic doublets. Therefore, we were interested in benchmarking SNP-based demultiplexing accuracy in a realistic and challenging experimental scenario: experimentally pooled cells from siblings, with a high sample number and without a common SNPs VCF file. We analyzed a published dataset of Xenopus laevis scRNA-seq data containing eight experimentally pooled samples from three Xenopus transgenic lines that each overexpress a different fluorescent gene (Lin et al, 2021) (Fig 2A). Xenopus is another common laboratory animal that has a published highquality genome (Session et al, 2016).
To identify the transgenic line origin based on fluorescent mRNA counts, we co-opted MULTIseqDemux (McGinnis et al, 2019) to assign donor identities based on the transgenic line-expressed fluorescent gene counts. Although this approach succeeded in assigning the Xenopus transgenic line of origin, the low number of fluorescent gene counts left many cells without sufficient data to make an assignment prediction (Figs 2B and S6). This low detection of fluorescent gene transcripts is a common problem when using fluorescent marker genes for demuxing pooled data (Lin et al, 2021) and underscores the importance of alternative means to demultiplex pooled data. To avoid benchmarking results with low-quality cells, stringent filtering was used to ensure that only cells likely to have accurate origin assignment by the fluorescent gene-based demultiplexing method would be used to benchmark SNP-based demultiplexing.
We first assessed the Xenopus data for correlation between fluorescent and SNP-based cell assignments. We observed a remarkable similarity in demultiplexing the eight-animal dataset with these two orthologous methods (Fig 2C-E). To further quantify the (B) Cell identification percentage (Fluor Cell ID %) by fluorescent-based assignment is plotted against average read depth. All cells were sorted by read depth and binned into 40 groups before calculating Fluor Cell ID%, and average total, and fluorescent read depth. Fluor Cell ID % is defined as the percentage of cells in each bin that were assigned to any of the three transgenic animal identities by fluorescent-based demultiplexing analysis. Binned data are colored by the average number of summed fluorescent reads per cell. Subsequent analysis plots focused on cells with between 5,000 and 40,000 mapped reads, and >0 summed fluorescent gene reads. (C) UMAP plot of Xenopus-pooled scRNA-seq data colored by fluorescent-based cell assignments. (D) UMAP plot of Xenopus-pooled scRNA-seq data colored by souporcell assignments relabelled according to correlating transgenic animal name. (E) Upset plot comparing cell assignments by souporcell to fluorescent based assignments. (F) Bar plots quantifying the distribution of souporcell assignments for cells from each animal. Left: of cells assigned to each transgenic line through fluorescent-based assignments, the number of those cells assigned by souporcell to each category is plotted. Right: of cells assigned to each transgenic line through fluorescent-based assignments, souporcell assignments are shown as a percentage of total cells in that category.
souporcell assignment accuracy, we evaluated how souporcell performed in comparison to transgenic line assignments made by fluorescent-based assignment ( Fig 2F). We found high agreement between the two methods, as demonstrated by the souporcell assignment agreement on 86-92% of cells in each sample ( Fig 2F). Furthermore, many cells were assigned as "negative" by the fluorescent-based approach, which could then be computationally rescued and assigned a transgenic line by souporcell. This is particularly notable for the mCherry-expressing cells, which were difficult to assign by the fluorescent-based assignment method ( Fig  2C and D). Also of interest, we found that one of the eight transgenic animals was almost completely absent from the dataset, only displaying dozens of total cells. Although individual animal assignments by souporcell could not be validated, this suggests animal dropout in this dataset that would not be identified without demultiplexing.
As with the in silico experiment, doublet assignments were the biggest discrepancy. It is not clear from the available data whether cells are true doublets being identified only by souporcell, or if the SNP-based demuxer is over-assigning doublets. However, assuming the worst-case scenario: that souporcell is over-assigning doublets, this is a relatively harmless mistake considering that a standard analysis would subsequently remove these doublets from downstream analysis. Overall, these results display a high degree of accuracy for SNP-based assignments in a complex, experimentally pooled mixture of Xenopus cells.
Next, we sought to evaluate whether souporcell could produce assignments in a previously published dataset of a pool of 30 zebrafish embryos (Metikala et al, 2021). We found this dataset to contain 25,279 cells, which according to the manufacturer would result in~20% of droplets being doublets. This would be considered a superloaded sample as it is over the supported cell recovery recommendations. However, this type of experiment is common and would be benefit from SNP-based demultiplexing. We applied souporcell to this dataset and found that it gave assignments for 95.1% of cells (24,047 of 25,279). Although this experiment does not have a ground truth to compare with, we found that the doublet rate reported by souporcell was 18.6%, close to the expected 20%. All 30 embryos were identified in this final dataset, ranging from 275 to 1,517 cells per individual zebrafish (Table S2 and Fig S7A). At the cluster level, there was wide variation in assigned doublets ranging from 0.8 to 30.2% of cells within a cluster ( Fig S7B). When pooling samples, a common assumption is that each cluster is composed of cells from all biological replicates. This assumption held true for nine of the identified 21 clusters in this sample, predominantly those with the most cells. The other 12 clusters were composed of cells from less than 30 replicates (animals), breaking the common assumption and suggesting a possible source of artifacts in interpreting results from these clusters ( Fig S7B). This analysis could not be validated because this dataset did not contain a distinct cell-labeling method to enable orthogonal non-SNP-based demultiplexing and thereby a ground truth for comparison. Two pieces of evidence point towards the reliability of these results (1) the high accuracy of assignments from all the in silico pooled datasets and (2) that souporcell fails to provide assignments in the inbred mouse datasets (see previous subheading). Altogether, this suggests that souporcell may be a viable means to identify biological replicates in large pools, which will provide informative metadata and enhance analysis for future experiments.
Successful SNP-based demuxing of pooled fluorescent Pleurodeles samples without a genome or previous SNP information After observing that SNP-based demultiplexing was reliable on various commonly used model species with genomes and with or without common SNP VCF files, we wanted to test the limits of what resources are required for accurate SNP-based demultiplexing. To do this, we took the current minimum adequate reference to map single-cell sequencing data, a de novo transcriptome, and did not use a common SNP VCF files. The Spanish ribbed newt, Pleurodeles waltl, is ar reemerging regenerative model organism, which has a high-quality de novo transcriptome (Matsunami et al, 2019) and more recently a high-quality genome (Brown et al, 2022, Preprint) but no available common SNPs VCF file. We first set out to assess souporcell demux assignments on pooled splenocytes from three transgenic Pleurodeles newts, which express different fluorescent proteins under the same ubiquitous promoter (CAG) (Joven et al, 2018;Eroglu et al, 2022). We designed this experiment to only contain non-erythroid spleen cells from one individual of each transgenic newt line, making it technically feasible to benchmark souporcell cell assignments for individuals through comparisons to fluorescent-based assignments ( Fig 3A).
As performed in the Xenopus analysis, we selected only cells that had sufficient read depth and fluorescent gene detection for benchmarking (Figs 3B and S8A-H). We found that fluorescentbased and souporcell assignments show a high degree of correlation ( Fig 3C and D). The fluorescent-based approach assigns many cells as "negative" (63% of cells pre-filter, 29% of cells after filtering) (Figs 3C and S8H). In this dataset, we observed that cells sequenced to a higher depth showed higher agreement between souporcell cell assignments and the fluorescent-based ground truth, which is expected because both methods should improve with more information ( Fig S8I). Furthermore, though these samples were pooled and sequenced as one sample, we find dramatic variation between individuals in the Pleurodeles splenocyte data (i.e., clusters derived from primarily one animal). The heterogeneity of sample representation in different cell clusters highlights the critical need for demultiplexing of pooled scRNA-seq data (Fig S8A-C, J, and K). Without demultiplexing, erroneous conclusions on novel cell states or types may arise.
We found a high degree of agreement between fluorescentbased cell assignments and SNP-based assignments from Pleurodeles scRNA-seq data ( Fig 3E). Of cells assigned by fluorescentbased demuxing to one of the three transgenic animals, 78-93% of those cells were correctly identified by souporcell (Fig 3F). When the two methods disagree, the most prevalent occurrence is "negative" fluorescent-based cell assignments that souporcell assigned to one of the transgenic animals (i.e., the rescue we also found in the Xenopus data). Similar to the Xenopus dataset, we attribute these fluorescence-based "negative" assignments to the low capture of fluorescent gene reads (Figs 3B and C and S8A and C). The next most common discrepancy between the two demultiplexing approaches  Cell identification percentage (CMO Cell ID %) by CMO-based assignments is plotted against average read depth. All cells were sorted by read depth, and binned into 40 groups before calculating CMO Cell ID %, and average total read depth per cell. CMO Cell ID % is defined as the percentage of cells in each bin that were any of the three CMO groups via CMO analysis. Subsequent analysis plots focused on high accuracy cells with between 5,000 and 40,000 mapped reads. (C) UMAP plot of Pleurodeles and Notophthalmus pooled scRNA-seq data colored by CMO assignments. (D) UMAP plot of Pleurodeles and Notophthalmus pooled scRNA-seq data colored by souporcell assignments relabelled according to correlating CMO labels. (E) Upset plot were contested doublets, cells labeled as "doublet" by fluorescentbased demuxing, for which souporcell disagrees. The low number of mapped fluorescent reads in these samples means that a "doublet" assignment by fluorescent-based demuxing likely indicates that a particular cell had one to two counts of two distinct fluorescent genes. This could be indicative of a doublet, but it is also possible that these are erroneous doublet assignments because the fluorescent-based doublet assignments too often rely on sparse data.
SNP-based demultiplexing succeeds in two species, pooled single-cell RNA-seq as shown by benchmarking against lipid hashing We next wanted to determine whether SNP-based demultiplexers could succeed in demuxing scRNA-seq datasets containing cells pooled from multiple species. This multi-species approach would be particularly useful for cross-species analyses and ecological studies. This approach capitalizes on the so-called "barnyard" approach, that is typically used when investigating doublet rates in new single-cell technologies. For this experiment, we pooled splenocytes from two P. waltl and two Notophthalmus viridescens ( Fig S9A). Notophthalmus is another salamander species studied for its regenerative capacity but equipped with only a de novo transcriptome and no published common SNP VCF file. To investigate SNP-based assignments and doublets in this dataset, we applied souporcell and a barnyard style analysis to this dataset and found a general agreement between these two approaches ( Fig  S9A-G).
Within this experiment of dual-species pooled 10x Genomics scRNA-seq libraries, distinct lipid hashing CMO labels were applied to label the origin of each of the Pleurodeles samples, and one hashing label was used for the pooled Notophthalmus samples, totaling three hashing labels ( Fig 4A). This approach provided a means to benchmark souporcell against the current best practice and the only available commercial multiplexing strategy for cells from species without specific hashing antibodies. For the analysis, sample identity determined by applying MULTIseqDemux to CMO data was used to evaluate SNP-based demultiplexing results. We constructed and aligned to a dual-species index made from the SuperTranscriptome (Abdullayev et al, 2013;Davidson et al, 2017;Matsunami et al, 2019) for each of these species. We removed low quality, low-depth cells for further benchmarking analysis and identified which souporcell individual assignments corresponded to each CMO through analyzing the correlation of assignments between the two methods ( Fig 4B). Interestingly, we did see that cells with more biological read depth showed improved percent agreement between souporcell-and Cellplex-based assignments, which was not a given because the Cellplex label is a distinct library ( Fig S10). This suggests that SNP-based demuxers do display a performance relationship with increased read depth. One important note, it appeared that many cells with high CMO reads per cell have relatively low biological reads per cell (Fig 4B), suggesting that although these cells can be CMO demultiplexed they would be unlikely to provide adequate biological information.
We found that souporcell sample assignments correlated tightly with CMO-based assignments (Fig 4C-E). When CMO-based assignments were interpreted as ground truth, the accuracy of souporcell assignments was clear, with 73-96% agreement between the two methods for assignment to the three CMO-based assignments (Fig 4F). The largest difference in assignments between these methods was that souporcell identified many doublets in the data from one CMO-labeled sample. SNP-based demultiplexers can only detect doublets from two different individuals (i.e., heterotopic doublets) which occurs at increasing ratios as sample number in the pool increases (e.g., three samples = 66% heterotypic doublets, four samples 75% heterotypic doublets, etc.). We thus also tested a second means to detect doublets, transcriptional-based doublet detection using scds to determine if this could compensate for missed doublets assignments (Bais & Kostka, 2020). We found that scds assigned many doublets not assigned by souporcell and vice versa ( Fig S11). Overall, souporcell and scds agreed on 68% of assignments, including singlets and doublets. In relation to Cellplex-based assignments, both computational approaches are in high agreement, but assign doublet identities to some Cellplexbased singlet assignments. This suggests that transcriptionalbased doublet detection potentially identifies doublets missed by souporcell and wet laboratory-based approaches. In general, these analyses demonstrate high agreement between CMO-based and SNP-based demultiplexing of this data from a notably complex experiment using species with poor genomic resources. Altogether, these data suggest that SNP-based demultiplexing is a powerful and near universal method to demultiplex pooled samples.

Discussion
The commercial appearance of single-cell sequencing technologies has enabled the study of complex tissues from any species. The technical and financial hurdles posed by these technologies can discourage their use, especially when biological replicates are needed to produce reliable datasets. In the coming years, we expect that similar to bulk RNA sequencing (Schurch et al, 2016), increased biological replicates should be prioritized over increased information on one sample in sc-seq studies. We demonstrate that sample pooling is a useful and viable tool for empowering singlecell datasets. It is critical that pooling and demultiplexing approaches are also species agnostic, as one of the major advantages of scRNA-seq is its broad applicability in most organisms.
The results presented in this study are in strong support of using sample pooling and SNP-based demultiplexers in scRNA-seq studies in species which possesses between individual genetic variability. We show with numerous species which have varying comparing cell assignments by souporcell to CMO-based assignments. (F) Bar plots quantifying the distribution of souporcell assignments for cells from each CMO group. Left: of cells assigned to each CMO group, the number of those cells assigned by souporcell to each identity is plotted. Right: of cells assigned to each CMO group, souporcell assignments are shown as a percentage of total cells in that category. genomic resources and genetic variability that SNP-based demultiplexing produces highly accurate demuxing results. We used a diverse swath of methods, each with unique strengths, to validate the widespread use of SNP-based demultiplexing. We attempted to estimate the SNP density required for successfully demultiplexing, but recommend in other species a pilot experiment in which a secondary demultiplexing method, or in silico pooling, is used to validate SNP-based demux results. This will also allow for the fine tuning of parameters to ensure trustworthy results. For example, testing out a low-quality common SNPs VCF from a more obscure model organism versus a sample-derived VCF may demonstrate that the sample-derived VCF performs better.
It will also be important to define the upper limit for the number of distinct pooled samples followed by SNP-based demuxing for each organism. In line with this, we obtained assignments when performing SNP-based demultiplexing on a pool of 30 zebrafish samples. However, without a benchmarking assessment from a ground truth derived from a distinct technology, it is uncertain if these results can be trusted. We therefore propose that using SNPbased demultiplexers on large pools needs to be further validated. This can be performed in silico as more single-replicate scRNA-seq datasets become published. Until then, the developers of souporcell indicate that 21 pooled human samples can be demultiplexed and speculate that this could work in up to 40 (https:// github.com/wheaton5/souporcell/issues/30). It is more likely that these large pool experiments occur in non-human samples. Thus, to aid future validations, we provide examples and modified memory-efficient scripts to pool samples and determine accuracy which will aid laboratories' working in any species to conduct their own benchmarking. In addition, this memory-efficient script allows for pooling without a VCF file, which will be critical for all researchers interested in benchmarking SNP demuxers in organisms without VCF files available. Once upper limits become established, another option to increase throughput would be layered multiplexing, for example, labeling 10 individuals with a CMO, another 10 with a second CMO, and so on. This could be paired with a second SNP-based demultiplexing step and could substantially expand sample throughput.
The ability to dramatically increase biological replicate throughput will have important implications for the quality of scRNA-seq data. Animal-to-animal variability in gene expression patterns has been observed for decades and is expected in many tissues, especially in immune cells, of animals that encounter different stressors or pathogens (Boedigheimer et al, 2008;Schokker et al, 2015;Paunovska et al, 2018;Reid et al, 2018). The existence of this natural biological variability coupled with technical variability in single-cell methods poses problems for studies without replicates or those that do not demultiplex pooled datasets (Hicks et al, 2018). Demultiplexing of pooled scRNA-seq is required to associate metadata with individual samples to account for between sample variation, replicate strengthened differential expression analysis, and confirmation of sample representation across scRNA-seq datasets and cell clusters (Squair et al, 2021;Zimmerman et al, 2021). The potential for heterogeneity in sample representation across cell clusters can be clearly seen in Figs 3 and S8, where some cell clusters are from one or two of the three individuals present in the pool. Although a variety of batch correction algorithms are available for single-cell studies (Tran et al, 2020), these cannot be used if there are no metadata distinguishing replicates. We recommend that SNP-based demultiplexing become a standard quality control for all pooled samples from animals with between animal genetic diversity.
An additional benefit of pooled single-cell experiments is the superloading of cells followed by heterotypic doublet detection. However, our results from synthetic pooled samples saw varied success in heterotypic doublet detection by SNP-based demultiplexers. Our analyses also indicate that sample number in each pool may directly impact doublet-detection efficiency, which should be considered in experimental design. We argue that the most pragmatic approach for doublet detection includes a combination of SNP-based and transcriptomic-based methods. In line with this, a thorough benchmarking in human samples of transcriptional and SNP-based doublet detectors suggest that an intersectional approach to doublet detection is superior to any one single method (Neavin et al, 2022 Preprint). Multiple independent programs designed specifically for doublet detection using transcriptomic instead of genotypic information are available for scRNA-seq data including DoubletDetection (Gayoso & Shor, 2022), Solo (Bernstein et al, 2020), and Scds (Bais & Kostka, 2020). To gain the full benefits of SNP-based demultiplexing with robust transcriptomic doublet detection, we recommend using one or more of these independent doublet-detection programs to identify and remove cell barcodes likely to contain doublets. We anticipate that optimized doublet detection algorithms along with improved bioinformatic resources for each species would improve doublet detection and facilitate superloading pooled single-cell data.
Many bioinformatic tools for scRNA-seq are not easily applied to all species. We intentionally applied a SNP-based demuxer to experimental set ups that would traditionally be expected to be computationally challenging, demonstrated the successful application, and provide a table and flowchart to illustrate the workarounds used in each case (Figs S2 and S12). Fortunately, we demonstrate that SNP demuxing can be performed without resources like a high quality, well annotated genome, or a populationwide common SNP genotypes VCF file. This is where souporcell stood out as the only tool that was feasible to use with only a de novo transcriptome and no VCF input. A secondary hurdle for applying SNP-based demuxers is that these tools often struggle when applied to datasets from species with large reference genomes or de novo transcriptomes (which typically have many contigs). When using de novo transcriptomes (i.e., Pleurodeles or Notophthalmus), we modified the default souporcell pipeline to enable remapping of reads (Fig S12, see the Materials and Methods section). Finally, though pooling of multiple species into one experiment is not currently commonplace, we demonstrated the successful SNP-based demuxing of a pooled two-species experiment. We expect that variations in cell or nuclei sizes between species could cause biases in cell capture depending on the scRNA-seq library preparation method, especially with microfluidic scRNA-seq. Despite these potential challenges, this experiment demonstrates the success of SNP-based demuxing in a particularly challenging scenario, and as more research focuses on cross-species comparisons, this is a clever means to increase sample throughput.
Overall, we successfully applied SNP-based methods to demultiplex pooled single-cell data from multiple species and a two species mix. Our benchmarking results suggest that SNP-based demultiplexing in these species is accurate relative to other available demultiplexing approaches. We hope that this study will increase awareness of single-cell pooling and SNP-based demultiplexing approaches for research communities not yet using these methods. Including SNP-based demuxers in experimental designs for future (and past) studies will greatly expand single-cell-based discoveries. This will facilitate work in well-known and lesser studied species by lowering the financial and technical hurdles of producing adequately powered single-cell experiments. We predict that both species agnostic and cross-species comparative studies are going to be increasingly fruitful in uncovering biological insights and the application of SNP-based demultiplexing with minimal genomic resources is critical for future research.

Animal handling and ethics
All experiments were carried out in post-metamorphic P. waltl and N. viridescens at Karolinska Institutet and were performed according to local and European ethical permits. N. viridescens and P. waltl were raised in-house. All animals were maintained under standard conditions of 12-h light/12-h darkness at 18-24°C (Joven et al, 2015). Before all experiments, animals were fully narcotized in 0.1% tricaine in housing water. For Pleurodeles in the fluorescent pool, experimental animals were housed in carbon-supplemented filtered tap water (55 g Tetra Marine Sea Salt, 11 g Ektozon N Salt, 2.5 ml of water conditioner/dechlorinator (Seachem Prime -Vattenberedningsmedel), 20 ml of Yokuchi Bitamin Multivitamin and 10 ml of calcium supplementation (Easy-Life Calcium) into 100 Liters of water). For Pleurodeles and Notophthalmus in the Cellplex experiment, animals were housed in the water as described but modified to only have sea salt, Ektozon, and calcium solution.

Collection of splenocytes for scRNA-seq
Fluorescence-pooling experiment Spleens were harvested from three separate P. waltl and processed as individual samples in parallel. All animals were postmetamorphic newts from established transgenic lines close to sexual maturity: one female tgTol2(CAG:Nucbow CAG:Cytbow) Simon (5.67 g weight, 10.8 cm snout-to-tail length) (Joven et al, 2018), one male tgSceI(CAG:loxP-GFP-loxP-Cherry) Simon (5.36 g, 11.1 cm) (Joven et al, 2018), and female tgTol2(CAG:loxP-Cherry-loxP-H2B:YFP) Simon (6.25 g, 10.8 cm) (Eroglu et al, 2022). Forceps and iridectomy scissors were used to remove the spleen in one piece, making sure that the forceps did not tear the spleen. Iridectomy scissors were used to carefully remove connective tissue. A 70-μm nylon mesh filter was inserted into a 50 ml conical, and 1 ml of ice cold 0.7X PBS was added to pre-wet the filter. The spleen was placed on the prewetted filter and slowly mashed through the filter using the plunger stopper end of a 3-ml syringe. Once the spleen appeared translucent, the plunger and filter were thoroughly washed with 10 ml ice cold 0.7X PBS and making sure no PBS was left on the plunger or filter. The 0.7X PBS solution was then poured into a 15 ml conical and centrifuged at 300g for 5 min at 4°C in a swinging bucket rotor. Supernatant was decanted, and cells were resuspended in 1 ml of 0.7X PBS. Cells were then counted using trypan blue to assess viability. Viabilities of non-erythroid cells (based on cellular morphology) were 85%, 98.4%, and 91.6% in eBFP, eGFP, and mCherry animals, respectively. FACS was used to sort for fluorescent-positive cells (Fig S13A-D). The samples have been analyzed without any removal of red blood cells using a BD Influx cytometer; the selected cells were sorted using a 100-μm nozzle in bulk into 1.5 ml Eppendorf tubes. The cell preparations and the sorted cells were kept in 4°C throughout the sorting. The results were analyzed with FlowJo software 10.8.1.
Debris and erythrocytes (note: erythrocytes do not express fluorescent markers under the CAG promoter and are far larger than other splenocytes) ( Fig S13E) were excluded with the gating strategy in the side scatter-forward scatter and singlet discrimination plots revealing separated fluorescent subpopulations of GFP, mCherry, and BFP as confirmed by sorting on microscopy slides (Figs S13 and S14). We sorted the subpopulations with the highest expression levels of each fluorescent tag. In total, 4 × 10 5 cells of GFP+, 4 × 10 5 of mCherry+, and 3.15 × 10 5 BFP+ were isolated in individual 1.5 ml Eppendorf tubes. GFP and mCherry expression were high, but BFP expression was dim. 500 μl of each solution was then added to an individual 1.5 ml Eppendorf tube for the fluorescent pool sample (trial mixture of pooled cells shown Fig S13D). This was centrifuged in a 1.5 ml Eppendorf at 4°C at 300g on a tabletop centrifuge. The supernatant was carefully removed and resuspended in the remaining volume. Cells were then manually counted and adjusted to a concentration targeting the collection of 10,000 cells on the 10x Genomics controller.
Cellplex/barnyard-pooling experiment Spleens from one female N. viridescens (4.45 g and 10.6 cm snoutto-tail) and one male N. viridescens (3.55 g and 10.2 cm) were collected, pooled into one tube, and then processed as described in "fluorescence-pooling experiment" excluding FACS cell sorting. The only modification to the above described processing is that cells were kept at room temperature throughout and that cells were resuspended in 0.7X PBS with 0.04% ultrapure BSA.
For P. waltl, spleens were removed from animals as described in "fluorescent-pooling experiment" from one adult tgSceI(CAG:loxP-GFP-loxP-Cherry) Simon female (23.5 g and 16.1 cm snout-to-tail length), one male tgSceI(CAG:loxP-GFP-loxP-Cherry) Simon (13.95 g and 15.7 cm) animal. Pleurodeles were processed as individual samples. After the spleen was thoroughly mashed through the prewetted 70-μm nylon filter and the filter being washed with 10 ml of 0.7X PBS, the cells were centrifuged at 300g for 5 min. Splenocytes were resuspended cells in 1 ml of sterile filtered 1X ACK (http:// cshprotocols.cshlp.org/content/2014/11/pdb.rec083295.short) to lyse red blood cells. After one minute of lysis, cells were diluted with 10 ml of 0.7X PBS and filtered through a 70-μm nylon mesh filter and centrifuged at 300g for 5 min. Cells were then resuspended in 0.7X PBS with 0.04% ultrapure BSA.
The pooled Notophthalmus sample and the individual Pleurodeles samples were then taken through the 10x Genomics 39 Cellplex-labeling protocol (CG000391; Demonstrated Protocol) the only modifications being the use of 0.7X PBS + 0.04% BSA for all wash and resuspension steps. Samples were stained with CM304 (Pleurodeles female), CMO305 (Pleurodeles male), and CMO306 (pool of Notophthalmus samples, one male and one female). Samples were manually counted and pooled at equal ratios immediately before loading onto the 10x Genomics Chromium Controller targeting 9,000 cells in total.

Preparation and sequencing of single-cell RNA sequencing libraries
Chromium single-cell 39 kit v3 (10x Genomics) was used according to the manufacturer's instructions.
The resulting cdhit.clstr file was then parsed using clstr2txt.pl from the cdhit package to generate a clusters.txt file. The id and clstr columns were obtained from this file and used to make an info.clusters.txt file to use with Lace (Davidson et al, 2017). Lace v1.14.1 was run: Lace_run.py reference_transcripts_v2.fa info.clusters.txt -t -cores 16 -o Noto_superTrans. The resulting gff files from both SuperTranscriptomes were converted to a gtf using AGAT perl script agat_convert_sp_gff2gtf.pl (Dainat et al, 2022).

Green monkey (Chlorocebus aethiops) in silico mixing and demultiplexing
Five samples were downloaded from a previously published dataset (Speranza et al, 2021). Data were downloaded from SRA using prefetch followed by fasterq-dump with flags split-files and include-technical. FASTQ files were obtained from AGM1_Mediastinal Lymph Node (SRR12507774-SRR12507781), AGM3_Mediastinal Lymph Node (SRR12507790-SRR12507797), AGM5_Mediastinal Lymph Node (SRR12507806-SRR12507813), AGM7_Mediastinal Lymph Node (SRR12507822-SRR12507829), and AGM9_Mediastinal Lymph Node (SRR12507846-SRR12507853). C. aethiops has a robust VCF file available (European Variation Archive: PRJEB7923) that needs to be used in conjunction with genome assembly Chlorocebus_sabeus 1.1 (GCA_000409795.2). This assembly did not have an annotation file available, and we generated a gtf file for this GenBank assembly using minimap2 (Li, 2018) (Danecek et al, 2021) to filter, sort, and change chromosomes name in the VCF: bcftools filter -include "MAF ≥ 0.05" -Oz -output Vervet.  And then converted chromosome names in this VCF to match the chromosomes in the bam files after mapping. chr_name_conv.txt has the format of "1 CM001941.2," with 1 being the original chromosome number and CM001941.2 being the chromosome accession number; this was carried out from chromosome 1-29.

Pooled Pleurodeles and Notophthalmus mapping and SNP-based demultiplexing
A dual species Cell Ranger 7.0.0 reference was made using the SuperTranscriptomes and corresponding gtf files (described above) from the two species using Cell Ranger mkref command. The two species, four animal, pooled scRNA-seq from Pleurodeles and Notophthalmus dataset was then mapped to this dual species index using Cell Ranger 7.0.0 count command. In addition, the Cell Ranger 7.0.0 multi-command was used to assess multiplexing Cellplex information for all cells in the same dataset. For the Cell Ranger multi-command, the following flags were used: (expect cells 10,000, min-assignmentconfidence 0.6). Souporcell demultiplexing was run identically to above on the Pleurodeles only samples but with N = 4, and the relevant FASTQ and dual species reference transcriptome FASTA files.
bcftools annotate -rename-chrs Conv.chr mgp_REL2021_snps.vcf.gz | bgzip > rename.mgp_REL2021_snps.vcf.gz To pool bams, we modified the original synth_pool.py script from Vireo's GitHub repository which was memory intensive and only allowed for pooling in the presence of a VCF file (https:// zenodo.org/record/7929057). To enable more widespread use, we now stabilized memory use throughout the pooling and added an option (-noregionFile) which can pool in the absence of a VCF file. This means that species that do not possess a VCF file can still do the in silico ground truth benchmarking we performed in this study. A pull request has been initiated to propagate the changes to Vireo's main GitHub repository (https://github.com/singlecell-genetics/vireo/pull/81). The modified version was used to pool the mouse data below. Note: if the pull request is accepted then calling the synth_pool.py script from Vireo's GitHub repository will in fact be this modified script with new options added. This script can produce identical pooled BAM files to the original.

Souporcell analysis details summary
A summary figure reviews the computational details used to run souporcell on the above datasets ( Fig S12). Overall, the standard souporcell default pipeline was used initially, but additional flags or pieces of the souporcell pipeline were run externally when this failed because of memory constraints or other problems. All SNP demuxers were run from a Demuxafy (Neavin et al, 2022 Preprint) singularity container (image version 1.0.3). SNP numbers in each VCF were counted using: grep "##" VCFname | wc -l.
For analyses of Pleurodeles and dual species datasets, the first two steps of the souporcell pipeline were run separately and then the output from these was introduced back into the souporcell pipeline for completion ( Fig S12). This allowed us to adjust computational parameters in the remapping stage that permitted the function to finish.
Analysis and benchmarking of souporcell assignments: R analysis A two part analysis in R and then Python was used to evaluate the efficacy of souporcell demultiplexing for each dataset. Scripts in R (version 4.1.2) primarily using Seurat (version 4.1.0) (Hao et al, 2021) were used to evaluate souporcell cell assignments found in the clusters.tsv file through comparison with known cell identities or cell assignments based on fluorescence, barnyard analysis, or CMO labeling. Full R scripts for all analyses are deposited in the GitHub page for clarity (https://github.com/RegenImm-Lab/SNPdemuxPaper). Seurat was used to import and analyze single-cell gene expression data for all datasets and to analyze the multiplexing capture data for the dualspecies Cellplex (CMO)-labeled dataset.

Cell filtering
For UMAP plots and all bar plots in benchmarking analysis, cells for all experimentally pooled datasets were filtered before analysis to select for cells most likely to have accurate calls by the respective benchmarking assay. Cell Ranger default filtered cells were parsed to select cells with between 5,000 and 40,000 mapped reads for each dataset. In addition, for Figs 2 and 3, we removed cells lacking counts of any of the three fluorescent mRNAs. Cell assignments based on fluorescence or Cellplex labeling were made through the analysis of fluorescent reads or Cellplex CMO reads by the Seurat MULTIseqDemux function (autoThresh = T) (McGinnis et al, 2019). Although the MULTIseqDemux function is written to assign cell identities based on CMO labels, we found that it works well with data from overexpressed fluorescent gene mRNAs. For analysis purposes, souporcell demultiplexing numerical cell labels were adjusted to relevant sample names based on correlation analyses to each relevant benchmarking demultiplexing method. UMAP (McInnes et al, 2018 Preprint) and upset plots (Lex et al, 2014) were generated in R scripts and annotated in Affinity Designer. A dataframe with compiled cell assignment information was exported to Python for further analysis.
Google Colab links for analyses are located here: